Batch 2 - Class 266 - Introduction to Data Science III
Zoom: send meeting Id and password
Start recording
Preclass Exercise:
Pick up a problem of your interest, and search the web to find out data sources related to that. Analyze those data sources to validate or invalidate some common hypotheses.
For purpose of classwork, you may use 750 records of train file for training, and rest for testing in the class itself. This is because results of the actual test file are not available. For homework, use the entire train file for modeling, and then test by uploading the results on the website
Draw hypotheses on what are some of the factors which could lead to less or more survival rate on the Titanic. Also discuss why these might be possible factors? Examples:
Gender
Cabin Class
Port of boarding
Age
Whether people were traveling in groups or alone
...
Open the Train file - this is a file that has data on about 900 passengers with features (gender, cabin class etc) and whether they survived or not. We will use this data to "train" our knowledge
Using Excel Filters or Pivot Tables. How many people survived?
Find out what % of males survived. What % of females? What do you observe?
If you had to predict whether a person will survive or not, just by looking at their gender, what rule would you follow?
Find out if the survival rate was different across 1st, 2nd and 3rd class (PClass)
If you had to predict whether a person will survive or not, just by looking at their cabin class, what rule would you follow?
Now, find the chances of survival by combining above factors, i.e. for each of the following - (Male, 1st Class), (Male, 2nd Class), (Male, 3rd Class), (Female, 1st Class), (Female, 2nd Class), (Female, 3rd Class)
What are the observations?
If you had to predict whether a person will survive or not, looking at both their gender and cabin class, what rule will you follow?
Do you think this predictor rule is better or worse than the one based only on gender or only on cabin class?
How would you check for survival on
Age: This is a continuous variable, so you may have to group into brackets of 10 years each
Number of siblings or spouse - how would you explain the fact that people traveling with more than 2 spouse of siblings didn't survive
Parch (parent/children) - what is the observation? What do you think will happen if we combine sibling, spouse, parent, child - how would you do that?
What do you expect on fares? Again, you may need to categorize. Is this an independent variable, or perhaps correlated to a previous one? How can we ascertain that?
Embarked - see the trend, and check if this is an independent variable or not
Cleaning the data
See which variables have missing data. How would you fill those?
For example, Embarked only has 2 missing elements, and there is strong bias, so those items can be filled as "mode" of the field
However, "Cabin" has a lot of missing data, so perhaps we can mark missing values as NA
Age - lot of missing data. How should we fill that? Perhaps use the salutation, and then take a median for that salutation? May be take the median as per their sex and class?
Testing the Predictor - Open the "Test" file. This file has feature data on passengers, but does not specify whether each of these people survived or not. Create a column "Survived" and mark every entry under that 0 or 1, depending on whether as per your predictor, that person would have survived or not.
Create a "Submission" file. A template of the file is enclosed. To create a file for submission, create a new file. Now from the Testing file, copy the columns labeled ID and Survived (only those two columns), and paste them onto the new file in first two columns. Save the file as CSV (comma separated) format.
On the website mentioned, upload the submission file. Browse and add the same file you have created for both "Code File" and "Solution File". You may put any note in the description.
This will show you how accurate the prediction was. You can test different prediction criteria by creating different submission files.
Types of predictors
Linear predictor: We could assign certain number of points to each feature, and then add the points corresponding to features in a particular row to get total probability for prediction
Why is this good?
What is the limitation?
Non-linear predictors: Can essentially customize the point system depending on initial variables - like a tree we drew above
Homework:
For the Titanic problem, draw more correlations with some parameters besides gender and class, and come up with a better predictor. Upload it on the website to test the prediction accuracy